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Method, computer program and data processing system for data 

clustering 

EPO- Munich 
24 

Field of invention f £ J an# 2001 

The present invention relates to the field of data clustering 
and in particular to clustering algorithms and quality 
determination . 



Background of the invention 

Clustering of data is a data processing task in which clusters 
are identified in a structured set of raw data. Typically the 
raw data consists of a large set of records each. record having 
the same or a similar format. Each field in a record can take 
any of a number of logical, categorical, or numerical values. 
Data clustering aims to group such records into clusters such 
that records belonging to the same cluster have a high degree of 
similarity. 

A variety of algorithms is known for data clustering. The K- 
means algorithm relies on the minimal sum of Euclidean distances 
to centers of clusters taking into consideration the number of 
clusters. The Kohonen-algorithm is based on a neural net and 
also uses Euclidean distances. IBM's demographic algorithm 
relies on the sum of internal similarities minus the sum of 
external similarities as a clustering criterion. Those and other 
clustering criteria are utilized in an iterative process of 
finding clusters. 

A common disadvantage of such prior art clustering algorithms is 
that different clustering algorithms applied to the same set of 
data may deliver largely different results. Even if the same 
algorithm is applied to the same set of data using a different 
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set of parameters as a starting condition a different result is 
likely to occur. In the prior art no objective criterion exists 
to compare the results of such clustering operations. 

One field of application of data clustering is data mining. From 
US 6,112,194 a method for data mining including a feedback 
mechanism for monitoring performance of mining tasks is known. A 
user selected mining technique type is received for the data 
mining operation. A quality measure type is identified for the 
user selected mining technique type. The user selected mining 
technique type for the data mining operation is processed and a 
quality indicator is measured using the quality measure type. 
The measured quality indication is displayed while processing 
the user selected mining technique type for the data mining 
operations . 

From US 6,115,708 a method for refining the initial conditions 
for clustering with applications to small and large database 
clustering is know. It is disclosed how this method is applied 
to the popular K-means clustering algorithm and how refined 
initial starting points indeed lead to improved solutions. The 
technique can be used as an initializer for other clustering 
solutions. The method is based on an efficient technique for 
estimating the modes of a distribution and runs in time 
guaranteed to be less than overall clustering time for large 
data sets. The method is also scalable and hence can be 
efficiently used on huge databases to refine starting points for 
scalable clustering algorithms in data mining applications. 

From US 6,100,901 a method for visualizing a multi -dimensional 
data set in which the multi-dimensional data set is clustered 
into k clusters, with each cluster having a centroid is known. 
Either two distinct current centroids or three distinct non- 
collinear current centroids are selected. A current 2- 
dimensional cluster projection is generated based on the 
selected current centroids. In the case when two distinct 
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current centroids are selected, two distinct target centroids 
are selected, with at least one of the two target centroids 
being different from the two current centroids. 



From US 5,857,179 a computer method for clustering documents and 
automatic generation of cluster keywords is known. An initial 
document by term matrix is formed, each document being 
represented by a respective M dimensional vector, where M 
represents the number of terms or words in a predetermined 
domain of documents. The dimensionality of the initial matrix is 
reduced to form resultant vectors of the documents. The 
resultant vectors are then clustered such that correlated 
documents are grouped into respective clusters. For each 
cluster, the terms having greatest impact on the documents in 
that cluster are identified. The identified terms represent key 
words of each document in that cluster. Further, the identified 
terms form a cluster summary indicative of the documents in that 
cluster. 



Summary of the invention 

A principal object of the present invention is to provide a 
method, data processing system and computer program product for 
data clustering and quality determination such that the 
qualities of clustering results can be compared on an objective 
basis. The quality index for a clustering result obtained in 
accordance with the invention is independent of the clustering 
algorithm used. 



Rather than relying on the clustering algorithm itself for 
quality determination the invention relies on a statistical 
analysis of the clustering result to determine the quality of 
the clustering. The statistical analysis uses a comparison of 
the foreground and background frequencies of buckets. The 
comparison results in a statistical parameter used to calculate 
a quality index. 
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According to a preferred embodiment the quality index is 
normalized such that even if different sets of data are used as 
a basis for different clustering operations still the results of 
the clustering are comparable based on the objective quality 
index . 

According to a further preferred embodiment of the invention a 
clustering operation is carried out by performing a data 
clustering operation based on a variety of different clustering 
algorithms either in parallel or sequentially, determining the 
qualities of the respective clustering results and ranking the 
results accordingly. The result with the highest quality index 
can be considered the overall result of the clustering 
operation. 

Further the invention provides a clustering algorithm relying on 
an objective quality index to be optimized in a number of 
iterations. This algorithm outputs a resulting quality index for 
its clustering result which is objective and can be compared to 
corresponding other results. 

A method of the invention is advantageously implemented in a 
data processing system by means of a corresponding computer 
program. If a number of different clustering algorithms is used 
it is advantageous to assign a dedicated processing unit of the 
data processing system to each clustering algorithm for the 
purpose of parallel processing. This has the advantage of 
minimizing the processing time required. 

Brief description of the drawings 

The present invention together with the above and other objects 
and advantages may best be understood from the following 
detailed description of the preferred embodiments of the 
invention illustrated in the drawings, wherein 
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Fig- 1 is a schematic representation of the structure of a 
cluster j ; 

Fig. 2 is a flow chart illustrating a preferred embodiment of 
the determination of a quality index; 

Fig. 3 is a flow chart illustrating the utilization of different 
clustering algorithms in parallel; 

Fig. 4 is a flow chart illustrating a clustering algorithm 

relying on an objective criterion to be optimized in a 
number of iterations; and 



Fig. 5 is a block diagram showing the structure of a data 
processing system. 



Fig. 1 shows a number of records R-jl, R- j 2 , R- j 5 . Each record 
has a number of n fields. Each field stores a variable 1. Each 
variable can take a certain number of states . Each such state is 
called a bucket, i.e. a value the variable can take. There are 
different types of variables such as logical, categorical, and 
numerical variables. An example of a categorical variable is the 
gender of a person. In this case the two corresponding buckets 
are "male" and "female". In the case of numerical variables 
typically the spectrum of the numeric range is separated into 
sub-ranges each sub-range defining a bucket of the variable. 

The raw data on which the data clustering operation is applied 
consists of a large volume of such structured data records. The 
result of a clustering operation yields a number k of clusters 
of which the cluster j is schematically depicted in the example 
of Fig. 1. 
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The variable 1=2 has the value A in the record R-jl. In other 
words the bucket i=l for the variable 1=2 in the record R-jl 
equals A. Other than A the variable 1=2 can also take values B 
or C, i.e. the bucket i=2 is B and the bucket i=3 for this 
variable 1=2 is B and C, respectively. For example in the record 
R-j3 of the cluster j the variable 1=2 has the bucket C(i=3) and 
in the record R-j4 of the cluster j the variable 1=2 has the 
bucket A again (i=l) . 

With respect to Fig. 2 now a preferred embodiment of a method 
for determining a quality index for a clustering result is 
explained in more detail. In step 2 0 the relative foreground 
frequency of a bucket i of the variable 1 is determined for 
cluster j. For example, the relative foreground frequency of the 
bucket i=l for the variable 1=2 in the cluster j of the example 
shown in Fig. 1 is 3 / 5 as the bucket i=l for this variable - 
which is A - occurs three times in the total of the five records 
contained in the cluster j . 

In the next step 21 the relative background frequency of the 
bucket i of the variable 1 is determined for all clusters, i.e. 
for the entire set of records contained in the clustered data. 
In the example considered with respect to Fig. 1 this is done by 
determining the number of occurrences of the bucket i=l for the 
variable 1=2 in all records and dividing the absolute number of 
occurrences by the number of all records. 

In step 22 a comparison value is determined to compare the 
relative foreground and background frequencies resulting from 
steps 20 and 21. The comparison can be performed by subtracting 
the relative foreground and background frequencies for a given 
bucket i of a given variable 1. This is reflected in the 
following equation : 



(1) 
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where f jit is the relative foreground frequency of the bucket i 

of the variable 1 in the cluster j and v., is the relative 

background frequency of the bucket i of the variable 1. This 
subtraction yields a parameter which is representative of the 
differentiation of the cluster j in comparison to all other 
clusters as far as the bucket i of the variable 1 is concerned. 
As the result of the subtraction can be negative it is 
advantageous to either square the result, 



(2) 



or to determine the absolute value: 

(3) |/y.M-v„|. 

These comparison values are determined and than added for all 
buckets i in all clusters j for a given variable 1 according to 
the following equation: 



The resulting parameter r t is multiplied with a factor in step 
24. The factor is determined in steps 25 and 26: 

In step 2 5 the optimal number of clusters optClust is 
determined. For example the optimal number of clusters can be 
defined to be equal to the maximum number of buckets of any of 
the variables. It is advantageous to set a threshold value for 
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the optimal number of clusters in case one of the variables has 
a very large number of buckets or if the maximum number of 
clusters is dictated by the purpose of the clustering operation. 
For example if the clustering is performed to identify 
demographic groups of people for group oriented advertisement 
typically not more than ten clusters corresponding to ten 
different marketing campaigns or segments are desirable. 

In step 2 6 the factor is calculated based on the optimal number 
of clusters and the actual number of clusters. The actual number 
of clusters is the number of clusters resulting from the 
clustering operation. 

In step 27 a division by the number of variables n is performed. 
The summation of the parameter r t for all variables 1 yields the 
quality index QI according to the following equation: 

(5 ) qj _ 1 * V r * Tam{pptClust 9 NbrClust\ 

n 1 max[optClust 9 NbrClust] 

where min\pptClust,NbrClust\ is the smaller number of optClust and 
NbrClust and max[optClust y NbrClust] is the bigger number. 

The quality index QI is outputted in step 28. 

According to a further preferred embodiment of the invention a 
normalizing value is determined to make the quality index 
independent of the data to which the clustering operation is 
applied. This has the advantage that even if clustering 
operations are performed on a different set of data still the 
quality of the results is comparable. The normalizing value o l 
for a given variable 1 is determined in accordance with the 
following equation : 



- 9 - 



DE9-2000-0057 



(6) 



1=1 1=1 



The equation 6 corresponds to the above equation 4 for the case 
of an imaginary situation where in one of the clusters the 
relative foreground frequency of a bucket is equal to one and 
equal to zero for all other clusters. In other words: All 
records containing the bucket are concentrated in the same 
cluster. This cluster corresponds to the first summation term in 
equation 6; all the other clusters are represented by the second 
summation term multiplied by the number of clusters k minus 1. 

This way the normalized quality index is determined in 
accordance with following equation: 



Fig. 3 shows an example of an application of the method of Fig. 
2 for performing a clustering of structured data 30 comprising 
records similar to the records of Fig. 1. The clustering 
algorithms CL 1, CL 2... CL q are applied on the data 30. This 
yields the clustering results RES 1, RES 2... RES q. For each of 
the results a corresponding quality index QI 1, QI 2,... QI q is 
determined in accordance with the method of Fig. 2. This is done 
by means of parallel data processing in the steps 31, 32 and 33, 
respectively . 

In step 34 the quality indices QI 1 , QU 2 , ... QU q are evaluated 
by numeric comparison. The numeric comparison of the quality 
indices results in an ordered list of the quality indices 
corresponding to a ranking of the respective results. The 
comparison of the- quality of the results is made possible by the 
invention because it allows to determine an objective quality 
index for each result purely based on a statistical analysis of 



(7) QI 



1 * Y 1 n * mm[optClust,NbrClust\ 
n itf o t max [optCIust, NbrClust ] 
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the result without relying on the clustering algorithm used to 
obtain the result. 

The ranking of the result is outputted in step 35. The result 
with the highest quality index QI can be considered the overall 
end result of the data clustering operation of Fig. 3. 

With respect to Fig. 4 a clustering method being based on the 
objective quality index of the invention is shown in more 
detail. The clustering method is applied to a set of structured 
data 40 comprising records substantially similar to the example 
Fig. 1. In step 41 a convenient initial set of clusters is 
selected. This can be done by using any of the known clustering 
methods. In step 42 the quality index Q( initial) for the initial 
set of clusters is calculated in accordance with equation (5) or 
(7) . 

In step 43 the initial set of clusters is modified by moving one 
or more records from their clusters to other clusters. In step 
44 the quality index Q (modified) for the modified set of 
clusters is calculated in accordance with equation (5) or (7) . 

In step 45 it is decided whether the quality index Q (modified) 
is greater than the quality index Q( initial) . If this is not the 
case, this implies that the quality of the clustering did not 
improve. As a consequence the modification previously performed 
in step 43 is reversed in step 46 and the control returns to 
step 43 to perform a different modification. 

In case the result of step 45 is that in fact Q (modified) is 
greater than Q( initial) and thus the quality of the clustering 
increased the control goes to step 47. In step 47 it is decided 
if the actual number of iterations has been reached. If this is 
the case the execution of the program stops in step 48. If the 
contrary is the case in step 49 the modified set of clusters is 
declared to be the initial set of clusters for a further 
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iteration step. This way the quality of the clustering is 
gradually increased until it reaches an ideal value or the 
operation is stopped after a predetermined number of iterations, 



Fig, 5 shows a schematic block diagram of a preferred embodiment 
of a data processing system in accordance with the invention. 
The data processing system has a database 50 for storage of 
structured data. The database 50 is connected to a number of 
parallel processing units Pi, P2, P3 and P4 via data bus 51. In 
each of the processing units PI to P4 a data clustering 
operation is performed based on a variety of data clustering 
algorithms. The corresponding results are outputted to a control 
program stored in memory 52 . The control program determines a 
quality index for each clustering result obtained by the 
parallel processing units PI to P4 . This is done in accordance 
with the preferred embodiments of Fig. 2 and Fig. 3. The 
clustering result with the highest quality index value is 
selected by the control program and outputted as result 53 . 
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1. A method for determining the quality of a result of a 

clustering data processing operation, the result comprising a 
set of clusters, a cluster having a set of buckets for each 
variable, the method comprising the steps of: 

a) determining the foreground frequency of a bucket within a 
given cluster; 



b) determining the background frequency of the bucket with 
respect to all clusters; 

c) comparing the foreground and background frequencies; and 

d) determining a quality index based on the comparison. 



2 . The method of claim 1 the step of comparing comprising 
subtracting the relative foreground and background 
frequencies . 



3 . The method of claim 1 or 2 further comprising squaring the 
result of the comparison. 

4 . The method of any one of the preceding claims further 
comprising the steps of 

a) determining an optimal number of clusters; and 

b) comparing the optimal number of clusters to the actual 
number of clusters resulting from the clustering data 
processing operation. 
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5. The method of claim 4 wherein the optimal number of clusters 
is determined by the maximum number of buckets for a 
variable . 



6. The method of claim 5 wherein the optimal number of clusters 
is set to a threshold value in case the maximum number of 
buckets is greater than the threshold. 



7. The method of any one of claims 4, 5 or 6 further comprising 

a) determining a factor based on the optimal number of 
clusters and the actual number of clusters; and 

b) multiplying the result of the comparison of the relative 
foreground and background frequencies with the factor. 



8. The method of any one of the preceding claims further 
comprising 

a) determining a normalizing value being independent of any 
correlations between fields of the data on which the data 
processing operation is applied; and 

b) normalizing the result of the comparison of the 
foreground and background frequencies by means of the 
normalizing value. 



9. 



The method of claim 8 wherein the normalizing value is 
determined by 
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a) comparing the background frequencies of the buckets with 
an imaginary cluster having a foreground frequency of the 
bucket equal to one; 

b) comparing the background frequencies of the buckets with 
an imaginary cluster having a foreground frequency of the 
bucket equal to zero; and 

c) summing the results of the corresponding comparison 
values . 

10. A method for data clustering comprising the steps of: 

a) performing a number of data clustering operations ; 

b) determining a quality index for each result of the data 
clustering operations in accordance with a method of any 
of the claims 1 to 9; 

c) selecting the result with the highest quality index as an 
end result of the data clustering. 



11. A method for data clustering comprising the steps of: 

a) selecting an initial set of clusters; 

b) determining a quality index for the clusters in 
accordance with a method of any of the claims 1 to 10; 

c) performing a number of iterations to improve the quality 
index . 
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The method of claim 10 further comprising the steps of: 

a) moving at least one record of at least one of the 
clusters to another cluster; 

b) determining the quality index for the modified clusters 
in accordance with a method of any of the claims 1 to 9; 

c) using the modified clusters as a new initial set of 
clusters in case the quality index improved. 

A computer program product stored on a computer usable 
medium, comprising a computer readable program means for 
causing a computer to perform a method according to any of 
the preceding claims 1 to 12 when the program is run on the 
computer . 

A data processing program for execution in a data processing 
system comprising software code portions for performing a 
method according to any of the preceding claims 1 to 12 when 
the program is run on the computer. 

A data processing system comprising means adapted for 
carrying out the steps of the method according to any of the 
preceding claims 1 to 12 . 

The data processing system of claim 15 having means for 
parallel execution of data clustering programs, means to 
determine the quality of the result of corresponding data 
processing operations and means for selecting one of the 
results having the highest quality index as an end result. 
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A method for determining an objective quality index for the 
result of a clustering operation is disclosed. This method can 
be used to evaluate the result of different clustering 
algorithms or can itself be the basis for an iterative 
clustering algorithm. The invention can be implemented by means 
of a computer program running on a data processing system which 
can have parallel processing units for performing different 
clustering algorithms in parallel. 
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